1: Overview

In this tutorial, we will learn about the package dpylr and how to apply it to the data we want to run a Statistical Analysis on.

Learning objectives:

  • Gain a deeper understanding about dyplr

  • Apply the dyplr functions you learned to any set of data you are working with

  • Make inferences using Anova test results

2: Dplyr Package

The dplyr is commonly using to manipulate data for analysis. Given the phrase assigned with this package is “A Grammar of Data Manipulation” it is an analysts most valuable tool when processing data in R (Wickham et al.).

There are over 100 functions available, each aiding in the reconstruction of the data frame, BUT we will only using the following to analyze our set of data:

 filter(): select specific rows

 mutate(): create new columns or modify within the same data frame

 rename(): change names of columns

 group_by(): group data base on specified variable

 summarize(): usually paired with a function like `group_by()`; helps achieve 
               calculated values for the certain groups

 glimpse(): shows a brief summary of a dataset. For example, displays the 
            structure of a column such as whether it is a character or double

 arrange(): reorder rows based on specified variable

   desc() : paired with `arrange` function; displays column in descending order
(Data Bentobox)
(Data Bentobox)

Additional Features 2.1

We will be needed a couple of additional functions to help us manipulate the data frame provided. The separate() function from the tidyr package will be useful later on. We can separate columns by specifying a delimiter function.

The piping operator %>% is apart of the magrittr package and will be essential throughout the cleaning process. Instead of writing lines of code and forgetting to close parentheses (. This function, will simplify our code.

Sample Data 2.3

We will be using the following sample data to practice the new functions we have learned.

# Creating sample data
student_data <- data.frame(
  StudentID = c(1, 2, 3, 4, 5),
  Full_Name = c("Doe, John", "Smith, Alison", "Johnson, Robert", "Brown, Evelyn", "Smith, Jamie"),
  Math_Score = c(85, 72, 90, 65, 78),
  Science_Score = c(92, 88, 75, 80, 85),
  History_Score = c(78, 85, 70, 60, 92)
  
)

Now it is time to apply the functions. We will need to call the packages. If you have not installed them, do so now using install.packages("name_of_package").

## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ readr     2.1.4
## ✔ ggplot2   3.4.4     ✔ stringr   1.5.0
## ✔ lubridate 1.9.2     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## 
## Attaching package: 'magrittr'
## 
## 
## The following object is masked from 'package:purrr':
## 
##     set_names
## 
## 
## The following object is masked from 'package:tidyr':
## 
##     extract

Using the piping operator %>%, we can combine all of these arguments

analyzed_data <- student_data%>%
  select(Full_Name, Math_Score, Science_Score,History_Score)%>% #did not want studentID included
  rename(Math= Math_Score, Science= Science_Score, History = History_Score)%>% #New name is on left of '=' & old name is on the right
  mutate(total= Math+Science+History)%>% #Using the New names now
  separate(Full_Name, into= c("Last_name", "First_name"), sep=",")%>% #1st call the variable you want to alter, then name the columns that will separate
  group_by(Last_name)%>%
  summarize(avg_math=mean(Math))%>%
  arrange(desc(avg_math))%>%
  filter(avg_math>70)%>%
  glimpse()
## Rows: 3
## Columns: 2
## $ Last_name <chr> "Johnson", "Doe", "Smith"
## $ avg_math  <dbl> 90, 85, 75

3: Data Description

Now that you got a taste of using these functions, it is time to implement this in the larger data frame. For this tutorial, we will clean and manipulate data from the U.S. Census Bureau website (Bureau).

The data provided a sample of 2022 payroll from the government positions. The following variables were provided:

State Government position Full-time Payroll Count of Full-time Employers Part-time Payroll Count of Part-time Employers Total Payroll

4: Exploring a Large Dataset

Where to begin?

Before I clean the data and decide on a research question, it is important to explore the data to see what variables I will be dealing with. I will do this my using the following functions in the Base R package: view(), nrow(), ncol(), length(names()). These functions will assist with my exploration process and get me familiarized with the dataset I will be working with.

#loading data in script
library(readxl)
salaries<- read_excel("2022_state.xlsx")
## New names:
## • `` -> `...2`
## • `` -> `...3`
## • `` -> `...4`
## • `` -> `...5`
## • `` -> `...6`
## • `` -> `...7`
## • `` -> `...8`
## • `` -> `...9`
salaries
## # A tibble: 1,952 × 9
##    STATE GOVERNMENT: EMPLOYMEN…¹ ...2  ...3  ...4  ...5  ...6  ...7  ...8  ...9 
##    <chr>                         <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
##  1 BY STATE AND BY FUNCTION: MA… <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
##  2 <NA>                          <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
##  3 Preliminary Data Released: J… <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
##  4 <NA>                          <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
##  5 <NA>                          <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
##  6 SOURCE: 2022 Census of Gover… <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
##  7 https://www.census.gov/progr… <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
##  8 their own estimates from the… <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
##  9 <NA>                          <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
## 10 <NA>                          <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
## # ℹ 1,942 more rows
## # ℹ abbreviated name: ¹​`STATE GOVERNMENT: EMPLOYMENT AND PAYROLL DATA`
head(salaries)  #looking a the first few rows
## # A tibble: 6 × 9
##   STATE GOVERNMENT: EMPLOYMENT…¹ ...2  ...3  ...4  ...5  ...6  ...7  ...8  ...9 
##   <chr>                          <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 BY STATE AND BY FUNCTION: MAR… <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
## 2 <NA>                           <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
## 3 Preliminary Data Released: Ju… <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
## 4 <NA>                           <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
## 5 <NA>                           <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
## 6 SOURCE: 2022 Census of Govern… <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA>  <NA> 
## # ℹ abbreviated name: ¹​`STATE GOVERNMENT: EMPLOYMENT AND PAYROLL DATA`
nrow(salaries)/50  #how positions for 50 states
## [1] 39.04
ncol(salaries)     #how many columns
## [1] 9

Note: Once the data was imported from Excel, the names of the columns did not transfer properly. rename columns; starting w/ line 15 The number of rows are way too large for analyzing. The total rows are 1992 and for each state, there are about 39 variables. condense and/or omit variables Again, to simplify results, we will be viewing on the total payroll for all forms of employment. filter necessary collumns and rows

4.1. Research Question–Cleaning Data

After exploring the data, I was able to develop my research question.

Which state government pays the highest salary? As I stated in the previous section, it is important to have a plan before tackling your data. In order to find the highest salary, I will most likely use a graph for visualization and a table to look at the concrete numbers. After using visualization tools, I want to ensure that the conclusion is statistical significant and run a one-way anova test. Alpha will equal 0.05. Therefore, if the p-value is less than 0.05, it will ensure it’s significance.

We will be calling the tidyverse package to utilize both features in dplyrand ggplot2

Now we will isolate the 4th & 6th columns & rows 14:1992. Then rename it to total_salaries. The select , slice, and mutate functions from the dplyr package will help us execute this task.

salaries%>%
  select("STATE GOVERNMENT: EMPLOYMENT AND PAYROLL DATA","...2","...4","...6")%>%
  slice(15:1992)%>%
  filter(...6 > 0)%>% #decided to filter out the positions that had a zero total
  mutate(full_time = "...4",
         part_time = "...6",
         states = "STATE GOVERNMENT: EMPLOYMENT AND PAYROLL DATA",
         positions = "...2"
         )%>%
  print()
## # A tibble: 1,106 × 8
##    STATE GOVERNMENT: EM…¹ ...2  ...4  ...6  full_time part_time states positions
##    <chr>                  <chr> <chr> <chr> <chr>     <chr>     <chr>  <chr>    
##  1 US                     Tota… 2438… 2441… ...4      ...6      STATE… ...2     
##  2 US                     Fina… 9903… 1735… ...4      ...6      STATE… ...2     
##  3 US                     Othe… 2929… 1770… ...4      ...6      STATE… ...2     
##  4 US                     Judi… 1150… 2064… ...4      ...6      STATE… ...2     
##  5 US                     Poli… 7606… 4653… ...4      ...6      STATE… ...2     
##  6 US                     Poli… 5661… 6704… ...4      ...6      STATE… ...2     
##  7 US                     Poli… 1944… 3982… ...4      ...6      STATE… ...2     
##  8 US                     Corr… 2323… 2295… ...4      ...6      STATE… ...2     
##  9 US                     High… 1149… 1861… ...4      ...6      STATE… ...2     
## 10 US                     Air … 2546… 57589 ...4      ...6      STATE… ...2     
## # ℹ 1,096 more rows
## # ℹ abbreviated name: ¹​`STATE GOVERNMENT: EMPLOYMENT AND PAYROLL DATA`

Ran into a problem when mutating the columns. For some reason each equates the whole column to assigned title and replaces the numeric values.

Let’s try this again…

So rename specifies renaming the column. ALWAYS remember that mutate manipulates the rows in columns.

payroll <- salaries %>%
  select("STATE GOVERNMENT: EMPLOYMENT AND PAYROLL DATA", "...2", "...4", "...6") %>%
  slice(15:1992) %>%
  rename(
    full_time = "...4",
    part_time = "...6",
    states = "STATE GOVERNMENT: EMPLOYMENT AND PAYROLL DATA",
    positions = "...2"
  ) %>%
  filter(full_time != 0 ,
    !positions%in% c(
      "Total - All Government Employment Functions",
      "Police Protection Total",
      "Fire Protection Total",
      "Education Total",
      "Education - Elementary and Secondary Total",
      "Education - Higher Education Total"
    )
  ) %>%                     #Also decided to filter out sub totals and 0s in each column
  filter(part_time != 0,
    !positions %in% c(
      "Total - All Government Employment Functions",
      "Police Protection Total",
      "Fire Protection Total",
      "Education Total",
      "Education - Elementary and Secondary Total",
      "Education - Higher Education Total"
    )
  )


  print(payroll)
## # A tibble: 898 × 4
##    states positions                                        full_time  part_time
##    <chr>  <chr>                                            <chr>      <chr>    
##  1 US     Financial Administration                         990359836  17356080 
##  2 US     Other Government Administration                  292909004  17702042 
##  3 US     Judicial and Legal                               1150539087 20648400 
##  4 US     Police Protection - Persons with Power of Arrest 566174779  670411   
##  5 US     Police Protection - Other                        194445002  3982950  
##  6 US     Corrections                                      2323903777 22956231 
##  7 US     Highways                                         1149897906 18612575 
##  8 US     Air Transportation                               25469702   57589    
##  9 US     Sea and Inland Port Facilities                   39896158   109433   
## 10 US     Public Welfare                                   1131087361 21508957 
## # ℹ 888 more rows

After cleaning the data, now it’s time to draw some inferences.

join_salary <- payroll%>%
  group_by(states)%>%
  summarize(mean_salary = mean(n()))
 print(join_salary)
## # A tibble: 51 × 2
##    states mean_salary
##    <chr>        <dbl>
##  1 AK              19
##  2 AL              19
##  3 AR              17
##  4 AZ              17
##  5 CA              18
##  6 CO              15
##  7 CT              18
##  8 DE              17
##  9 FL              18
## 10 GA              16
## # ℹ 41 more rows

Had trouble with finding the mean because the salaries are being treated as characters not doubles

glimpse(payroll)# full_time and part_time salaries are characters
## Rows: 898
## Columns: 4
## $ states    <chr> "US", "US", "US", "US", "US", "US", "US", "US", "US", "US", …
## $ positions <chr> "Financial Administration", "Other Government Administration…
## $ full_time <chr> "990359836", "292909004", "1150539087", "566174779", "194445…
## $ part_time <chr> "17356080", "17702042", "20648400", "670411", "3982950", "22…
join_salaries <- payroll %>%
  mutate(
    full_time = as.numeric(full_time),
    part_time = as.numeric(part_time),
    total_pay = full_time + part_time
  )%>%
  group_by(states)%>%
  summarize(mean_salary = mean(total_pay))
 head(join_salaries)
## # A tibble: 6 × 2
##   states mean_salary
##   <chr>        <dbl>
## 1 AK        8079107.
## 2 AL       25724643.
## 3 AR       17215672.
## 4 AZ       24609665.
## 5 CA      213737235.
## 6 CO       36675423.
glimpse(join_salaries) # now full_time and part_time are doubles
## Rows: 51
## Columns: 2
## $ states      <chr> "AK", "AL", "AR", "AZ", "CA", "CO", "CT", "DE", "FL", "GA"…
## $ mean_salary <dbl> 8079107, 25724643, 17215672, 24609665, 213737235, 36675423…

Now it’s time to time mean and median of total_pay. Also, add these columns together

#To include positions we had to add full_time and part_time with payroll dataset
  
complete_pay <- payroll %>%
  mutate(
    full_time = as.numeric(full_time),
    part_time = as.numeric(part_time),
    total_pay = full_time + part_time
  )

 position_calc_pay <- complete_pay %>%
  group_by(positions) %>%
  summarize(mean_pay = mean(total_pay, na.rm = TRUE), median_pay= median(total_pay)) %>%
  arrange(desc(mean_pay))
 head(position_calc_pay)
## # A tibble: 6 × 3
##   positions                                    mean_pay median_pay
##   <chr>                                           <dbl>      <dbl>
## 1 Education - Higher Education Other         240348688. 100699491 
## 2 Education - Higher Education Instructional 216376762.  79736800 
## 3 Hospitals                                  108920803.  32047634 
## 4 Corrections                                 91765010.  24000538.
## 5 Transit                                     84731883.  44594895 
## 6 Highways                                    50517611.  15599646
state_calc_pay <- complete_pay%>%
  group_by(states)%>%
  summarize(mean_pay = mean(total_pay, na.rm = TRUE), median_pay= median(total_pay)) %>%
  arrange(desc(mean_pay))
head(state_calc_pay)
## # A tibble: 6 × 3
##   states   mean_pay median_pay
##   <chr>       <dbl>      <dbl>
## 1 US     924928786.  310611046
## 2 CA     213737235.  112976658
## 3 TX      99948947.   47047908
## 4 NY      82109160.   51256641
## 5 PA      55413687.   30355233
## 6 MI      54748541.   20324641

From the calculated mean(), it appears that California has the highest mean salary out of the all of the states.

4.2. GGplot2

As a part of the tidyverse package, we will also be utilizing the ggplot2 to help use visual aid in our analysis.

library(tidyverse)
#Scatterplot of 50 States Salary

ggplot(join_salaries, aes(states, mean_salary)) +
  geom_point()+ labs(title= "                                     Means Salaries of AK-NH", x = "States", y = "Mean") +
  coord_cartesian(xlim = c(0, 30)) #

ggplot(join_salaries, aes(states, mean_salary)) +
  geom_point()+ labs(title= "                                     Means Salaries of NH-WY", x = "States", y = "Mean") +
  coord_cartesian(xlim = c(30, 51)) #

#Boxplot Plots
p<-ggplot(data=complete_pay, mapping = aes(x = states, y = total_pay, color= states)) +
   geom_boxplot(show.legend = FALSE)+
   #California looks the highest
  labs(title= "                                     Means Salaries of Each State", x = "States", y = "Count")
p

Since there are so many states, it is difficult for the end-user to see all of the points. There is a feature that we tried in the scatterplot, coord_cartesian(xlim= c(0,30)), allowing us to change and manipulate the length. But as shown, we needed at least two or more intervals. For a cleaner look, the plotly library was a great addition to our presentation.

First I assigned the plot a name. Then I called the ggplotly()function to convert the ggplot2 to plotly. This feature allows the end-user to interact with the data.

library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
ggplotly(p)

4.3. Using more Data to Confirm Conclusion

California can be expected to have a highest mean salary because it may have the highest cost of living. Using the data from the advisorsmith.com website,“City Cost of Living Index Article”, we can confirm this conclusion(Mak).

## # A tibble: 502 × 5
##      Lat   Lon City       states Cost_of_Living_Index
##    <dbl> <dbl> <chr>      <chr>                 <dbl>
##  1  37.1 -88.6 Paducah    " KY"                  85.7
##  2  33.5 -86.8 Birmingham " AL"                  90.7
##  3  34.2 -86.8 Cullman    " AL"                  90.5
##  4  30.6 -87.9 Daphne     " AL"                  96.6
##  5  34.6 -87.0 Decatur    " AL"                  88.6
##  6  31.2 -85.4 Dothan     " AL"                  86.3
##  7  34.8 -87.7 Florence   " AL"                  86.8
##  8  34.0 -86.0 Gadsden    " AL"                  86  
##  9  34.7 -86.6 Huntsville " AL"                  91.3
## 10  30.7 -88.0 Mobile     " AL"                  88.3
## # ℹ 492 more rows
  var_data%>%
  group_by(states) %>%
  plot_geo(x = ~Lon, y = ~Lat) %>%
  add_markers(size = ~Cost_of_Living_Index, text= ~Cost_of_Living_Index)%>%#marker size is determined by index value
  layout(title = "Cost of Living Index by US City") 
## Warning: `line.width` does not currently support multiple values.
mean_cost <- var_data%>%
  group_by(states)%>%
  summarize(mean_cost_of_living = mean(Cost_of_Living_Index))%>%
  arrange(desc(mean_cost_of_living))%>%print()
## # A tibble: 51 × 2
##    states mean_cost_of_living
##    <chr>                <dbl>
##  1 " HI"                 151.
##  2 " CA"                 126.
##  3 " DC"                 120.
##  4 " MA"                 115.
##  5 " VT"                 113.
##  6 " RI"                 112.
##  7 " AK"                 109.
##  8 " CT"                 109.
##  9 " CO"                 108.
## 10 " WA"                 107.
## # ℹ 41 more rows

From this table, we can see that California is the second highest cost of living. This further explains why mean_salary is expected to be so high.

plot<-ggplot(var_data, aes(y = Cost_of_Living_Index, x = states, color=states)) + 
  geom_boxplot(show.legend = FALSE) +
ggtitle("Cost_of_Living_Index in Each State")
  
library(plotly)
ggplotly(plot)

5: Statistical Analysis

In order to confirm that our descriptive statistics has statistical significance, we will need to run tests. The common tests that can be ran in R are: t-test, linear regression, chi-squared test, and ANOVA.

We will be running two-way anova test without interaction between the variables. This is an appropriate test for our data because we want to know the relationship between the dependent variable (total_pay) and two categorical independent-variables (states and positions). Once we run the anova test, we will want to look at the p-value. If it is under .05, then we claim that the state and position group means are different. This indicates, that where the position is located and the title of the job will have an impact on the salary.

#Two_Way Anova Testing
sal_aov <- aov(total_pay ~ states+positions, complete_pay)
summary(sal_aov)
##              Df    Sum Sq   Mean Sq F value   Pr(>F)    
## states       50 2.344e+19 4.687e+17   6.250  < 2e-16 ***
## positions    28 6.411e+18 2.290e+17   3.053 2.73e-07 ***
## Residuals   819 6.142e+19 7.499e+16                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#Two_Way Anova Testing w/o US

w_o_US_pay <- payroll %>%
  mutate(
    full_time = as.numeric(full_time),
    part_time = as.numeric(part_time),
    total_pay = full_time + part_time
  )%>%
  filter(states != "US")
w_o_US_aov <- aov(total_pay ~ states+positions, w_o_US_pay)
summary(w_o_US_aov)
##              Df    Sum Sq   Mean Sq F value Pr(>F)    
## states       49 9.645e+17 1.968e+16    9.46 <2e-16 ***
## positions    28 1.032e+18 3.687e+16   17.72 <2e-16 ***
## Residuals   791 1.646e+18 2.081e+15                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Both p-values are under .05, which means we can reject the null hypothesis. The states and positions will have an impact on the individuals payroll in the year 2022. Another thing to pay attention to, is the US (sum for all states) was inclusion in the first two_way testing that was ran. The residual, distance from mean, was decreased once the US was removed. This makes sense because there is a lot of variation across the states. Therefore, this variable would silence a states’ influence on salary.

Conclusion

From this tutorial, you can see how dplyr plays an extensive role in data analytics. As shown earlier, it would be difficult to catch under lying problems without using our glimpse() function. Additionally, we get a lot of information from our data by grouping our data using the group_by() and summarize function. Before using any data visualization, we knew that California was going to have one of the highest means in the group. And we discovered later on it was because of the their cost of living. Now to make things more intriguing is that Hawaii was actually the highest cost of living and California came second. That came as a surprise, because it was not even in the top 5 mean salaries. We could have discover some more data to see what jobs are more prevalent and/or lucrative in the state of Hawaii. If we refer back to our ANOVA test, the location and position tend to influence pay.

Appendix A: References

Function reference

Data frame verbs

Rows

Verbs that principally operate on rows.

arrange(): Order rows using column values

distinct(): Keep distinct/unique rows

filter(): Keep rows that match a condition

slice() slice_head() slice_tail() slice_min() slice_max() slice_sample() Subset rows using their positions

Columns

Verbs that principally operate on columns.

glimpse: Get a glimpse of your data

mutate(): Create, modify, and delete columns

pull(): Extract a single column

relocate(): Change column order

rename() rename_with(): Rename columns

select(): Keep or drop columns using their names and types

Groups: Verbs that principally operate on groups of rows.

count() tally() add_count() add_tally(): Count the observations in each group

group_by() ungroup(): Group by one or more variables

dplyr_by: Per-operation grouping with

.by/by rowwise(): Group input by rows

summarise() summarize(): Summarise each group down to one row

reframe(): Transform each group to an arbitrary number of rows

n() cur_group() cur_group_id() cur_group_rows() cur_column(): Information about the “current” group or variable

Data frames

Verbs that principally operate on pairs of data frames.

bind_cols(): Bind multiple data frames by column

bind_rows(): Bind multiple data frames by row

intersect() union() union_all() setdiff() setequal() symdiff(): Set operations

inner_join() left_join() right_join() full_join(): Mutating joins

nest_join(): Nest join

semi_join() anti_join(): Filtering joins

cross_join(): Cross join

join_by(): Join specifications

rows_insert() rows_append() rows_update() rows_patch() rows_upsert() rows_delete(): Manipulate individual rows

Multiple columns

Pair these functions with mutate(), summarise(), filter(), and group_by() to operate on multiple columns simultaneously.

across() if_any() if_all(): Apply a function (or functions) across multiple columns c_across(): Combine values from multiple columns pick(): Select a subset of columns

Vector functions

Unlike other dplyr functions, these functions work on individual vectors, not data frames.

between(): Detect where values fall in a specified range

case_match(): A general vectorised

switch() case_when(): A general vectorised if-else

coalesce(): Find the first non-missing element

consecutive_id(): Generate a unique identifier for consecutive combinations

cumall() cumany() cummean(): Cumulativate versions of any, all, and mean

desc(): Descending order

if_else(): Vectorised if-else

lag() lead(): Compute lagged or leading values

n_distinct(): Count unique combinations

na_if(): Convert values to NA

near(): Compare two numeric vectors

nth() first() last(): Extract the first, last, or nth value from a vector

ntile(): Bucket a numeric vector into n groups

order_by(): A helper function for ordering window function output

percent_rank() cume_dist(): Proportional ranking functions

recode() recode_factor(): Recode values

row_number() min_rank() dense_rank(): Integer ranking functions

Built in datasets

band_members band_instruments band_instruments2: Band membership

starwars: Starwars characters storms: Storm tracks data

Grouping helpers

This (mostly) experimental family of functions are used to manipulate groups in various ways.

group_cols(): Select grouping variables

group_map() group_modify() group_walk(): Apply a function to each group

group_trim(): Trim grouping structure

Superseded

Superseded functions have been replaced by new approaches that we believe to be superior, but we don’t want to force you to change until you’re ready, so the existing functions will stay around for several years.

sample_n() sample_frac(): Sample n rows from a table

top_n() top_frac(): Select top (or bottom) n rows (by value)

scoped: Operate on a selection of variables

all_vars() any_vars(): Apply predicate to all variables

vars(): Select variables with_groups(): Perform an operation with temporary groups

Remote tables

auto_copy(): Copy tables to same source, if necessary

compute() collect() collapse(): Force computation of a database query

copy_to(): Copy a local data frame to a remote src

ident(): Flag a character vector as SQL identifiers

explain() show_query(): Explain details of a tbl

tbl() is.tbl(): Create a table from a data source

sql(): SQL escaping.

Works Cited

Bureau, US Census. “2022 Census of Governments, Survey of Public Employment & Payroll Datasets & Tables.” Census.gov, www.census.gov/data/datasets/2022/econ/apes/2022.html. Accessed 7 Dec. 2023.

Dwi Cahyaningtyas, Rany . “RPubs - R Markdown Themes Guide: Unleash Your Creativity with Captivating Document Styles.” Rpubs.com, 28 May 2023, rpubs.com/ranydc/rmarkdown_themes. Accessed 7 Dec. 2023.

Mak, Adrian. “U.S. Cost of Living Index by City: Downloadable Data – AdvisorSmith.” Advisorsmith.com, 2021, advisorsmith.com/data/coli/.

Sievert, Carson. Interactive Web-Based Data Visualization with R, Plotly, and Shiny. Chapman and Hall/CRC, 2020, plotly-r.com.

Wickham, Hadley. Ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2016, ggplot2.tidyverse.org.

Wickham, Hadley , et al. Dplyr: A Grammar of Data Manipulation. 2023, CRAN.R-project.org/package=dplyr.

Https://Www.databentobox.com/2019/11/01/Dplyr-With-String/Featured.png

https://dplyr.tidyverse.org/reference/index.html#vector-functions